Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower #33613

Isotr0py · 2024-09-20T08:21:05Z

What does this PR do?

Fixes # (issue)

Some finetuned llava models like FreedomIntelligence/HuatuoGPT-Vision-7B also use LlavaQwen2ForCausalLM architecture, but they use Clip instead of Siglip as vision_tower.
However, the current llava conversion script hardcoded Qwen with Siglip, which caused the converted model with wrong vision tower config
This PR adapts this part to make the script more flexible.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@amyeroberts @zucchini-nlp
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

amyeroberts

Hi @Isotr0py, could you show the running, output and sucessful conversion of a previous Qwen checkpoint with these changes?

Otherwise it looks OK to me, but @zucchini-nlp will know which checkpoints we should verify are still compatible with these updates

Isotr0py · 2024-09-21T03:26:17Z

I printed the converted model and config for llava-interleave and LlavaQwen2. And seems that previous Qwen checkpoint can still be converted.

llava-interleave

The convert command for llava-interleave:

python transformers/src/transformers/models/llava/convert_llava_weights_to_hf.py --text_model_id Qwen/Qwen1.5-7B-Chat --vision_model_id google/siglip-so400m-patch14-384 --old_state_dict_id lmms-lab/llava-next-interleave-qwen-7b

The converted model and config:

Fetching 4 files: 100% 4/4 [00:00<00:00, 34521.02it/s]
LlavaForConditionalGeneration(
  (vision_tower): SiglipVisionModel(
    (vision_model): SiglipVisionTransformer(
      (embeddings): SiglipVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
        (position_embedding): Embedding(729, 1152)
      )
      (encoder): SiglipEncoder(
        (layers): ModuleList(
          (0-25): 26 x SiglipEncoderLayer(
            (self_attn): SiglipSdpaAttention(
              (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
            )
            (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
            (mlp): SiglipMLP(
              (activation_fn): PytorchGELUTanh()
              (fc1): Linear(in_features=1152, out_features=4304, bias=True)
              (fc2): Linear(in_features=4304, out_features=1152, bias=True)
            )
            (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
    )
  )
  (multi_modal_projector): LlavaMultiModalProjector(
    (linear_1): Linear(in_features=1152, out_features=4096, bias=True)
    (act): GELUActivation()
    (linear_2): Linear(in_features=4096, out_features=4096, bias=True)
  )
  (language_model): Qwen2ForCausalLM(
    (model): Qwen2Model(
      (embed_tokens): Embedding(152000, 4096)
      (layers): ModuleList(
        (0-31): 32 x Qwen2DecoderLayer(
          (self_attn): Qwen2SdpaAttention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=True)
            (k_proj): Linear(in_features=4096, out_features=4096, bias=True)
            (v_proj): Linear(in_features=4096, out_features=4096, bias=True)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (rotary_emb): Qwen2RotaryEmbedding()
          )
          (mlp): Qwen2MLP(
            (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
            (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
            (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen2RMSNorm((4096,), eps=1e-06)
          (post_attention_layernorm): Qwen2RMSNorm((4096,), eps=1e-06)
        )
      )
      (norm): Qwen2RMSNorm((4096,), eps=1e-06)
    )
    (lm_head): Linear(in_features=4096, out_features=152000, bias=False)
  )
)

LlavaConfig {
  "ignore_index": -100,
  "image_token_index": 151646,
  "model_type": "llava",
  "projector_hidden_act": "gelu",
  "text_config": {
    "_name_or_path": "Qwen/Qwen1.5-7B-Chat",
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "bos_token_id": 151643,
    "eos_token_id": 151645,
    "intermediate_size": 11008,
    "max_position_embeddings": 32768,
    "max_window_layers": 28,
    "model_type": "qwen2",
    "rope_theta": 1000000.0,
    "sliding_window": null,
    "torch_dtype": "bfloat16",
    "use_sliding_window": false,
    "vocab_size": 152000
  },
  "transformers_version": "4.44.2",
  "vision_config": {
    "hidden_act": "gelu_pytorch_tanh",
    "hidden_size": 1152,
    "image_size": 384,
    "intermediate_size": 4304,
    "layer_norm_eps": 1e-06,
    "model_type": "siglip_vision_model",
    "num_attention_heads": 16,
    "num_hidden_layers": 26,
    "patch_size": 14,
    "vision_use_head": false
  },
  "vision_feature_layer": -1,
  "vision_feature_select_strategy": "full"
}

`LlavaQwen2` with `Clip`

The convert command for LlavaQwen2 with Clip:

python transformers/src/transformers/models/llava/convert_llava_weights_to_hf.py --text_model_id Qwen/Qwen2-7B --vision_model_id openai/clip-vit-large-patch14-336 --old_state_dict_id FreedomIntelligence/HuatuoGPT-Vision-7B

The converted model and config:

LlavaForConditionalGeneration(
  (vision_tower): CLIPVisionModel(
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
        (position_embedding): Embedding(577, 1024)
      )
      (pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-23): 24 x CLIPEncoderLayer(
            (self_attn): CLIPSdpaAttention(
              (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            )
            (layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
            )
            (layer_norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    )
  )
  (multi_modal_projector): LlavaMultiModalProjector(
    (linear_1): Linear(in_features=1024, out_features=3584, bias=True)
    (act): GELUActivation()
    (linear_2): Linear(in_features=3584, out_features=3584, bias=True)
  )
  (language_model): Qwen2ForCausalLM(
    (model): Qwen2Model(
      (embed_tokens): Embedding(152128, 3584)
      (layers): ModuleList(
        (0-27): 28 x Qwen2DecoderLayer(
          (self_attn): Qwen2SdpaAttention(
            (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
            (k_proj): Linear(in_features=3584, out_features=512, bias=True)
            (v_proj): Linear(in_features=3584, out_features=512, bias=True)
            (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
            (rotary_emb): Qwen2RotaryEmbedding()
          )
          (mlp): Qwen2MLP(
            (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
            (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
            (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
          (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
        )
      )
      (norm): Qwen2RMSNorm((3584,), eps=1e-06)
    )
    (lm_head): Linear(in_features=3584, out_features=152128, bias=False)
  )
)

LlavaConfig {
  "ignore_index": -100,
  "image_token_index": 151646,
  "model_type": "llava",
  "projector_hidden_act": "gelu",
  "text_config": {
    "_name_or_path": "Qwen/Qwen2-7B",
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "bos_token_id": 151643,
    "eos_token_id": 151643,
    "hidden_size": 3584,
    "intermediate_size": 18944,
    "max_position_embeddings": 131072,
    "max_window_layers": 28,
    "model_type": "qwen2",
    "num_attention_heads": 28,
    "num_hidden_layers": 28,
    "num_key_value_heads": 4,
    "rope_theta": 1000000.0,
    "sliding_window": null,
    "torch_dtype": "bfloat16",
    "use_sliding_window": false,
    "vocab_size": 152128
  },
  "transformers_version": "4.44.2",
  "vision_config": {
    "hidden_size": 1024,
    "image_size": 336,
    "intermediate_size": 4096,
    "model_type": "clip_vision_model",
    "num_attention_heads": 16,
    "num_hidden_layers": 24,
    "patch_size": 14,
    "projection_dim": 768,
    "vocab_size": 32000
  },
  "vision_feature_layer": -2,
  "vision_feature_select_strategy": "default"
}

fix llavaqwen2 model conversion

dfe8241

amyeroberts reviewed Sep 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower #33613

Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower #33613

Isotr0py commented Sep 20, 2024

amyeroberts left a comment

Isotr0py commented Sep 21, 2024 •

edited

Loading

Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower #33613

Are you sure you want to change the base?

Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower #33613

Conversation

Isotr0py commented Sep 20, 2024

What does this PR do?

Before submitting

Who can review?

amyeroberts left a comment

Choose a reason for hiding this comment

Isotr0py commented Sep 21, 2024 • edited Loading

llava-interleave

LlavaQwen2 with Clip

Isotr0py commented Sep 21, 2024 •

edited

Loading

`LlavaQwen2` with `Clip`